&FOSR  68- 


THE  SHUFOI 


Best 

Available 

Copy 


AIRMAN  QUALIFYING  EXAMINATION-66 
ADMINISTERED  AS  A 
CONFIDENCE  TEST 


Emir  H.  Shu  ford  Jr.  and  H  Edward  Massengill  Jr. 


AIRXkV  QUALIFYING  EXAM  NATION- 66  ADMINISTERED  AS  A  CONFIDENCE  TEST 
Essvt  B.  Skuford Jp.  end  3.  Sdssrd  Hsssengilla  Jr. 


Within  the  past  tve  decades,  several  develocnents  have  occured  which  suggest 
that  it  stay  be  possible  to  scale  great  improvements  in  ability,  aptitude,  and 
achievement  testing.  Gains  have  been  made  in  aptitude  and  achievement  test¬ 
ing,  but  during  the  past  five  or  so  decades  of  testing  they  have  resulted 
fxon  efforts  to  improve  the  development  and  selection  of  test  items  for  use 
within  the  standard  framework  for  objective  and  semi-objective  testing  uti¬ 
lizing  multiple-choice  and  completion- type  item  formats.  The  new  develop¬ 
ments,  however,  call  for  a  fundamental  change  in  the  way  that  a  test  is  ad¬ 
ministered.  They  allow  for  a  more  penetrating  method  of  measurement  which 
moves  ahead  of  the  traditional  choice  response  of  the  examinee  to  more  di¬ 
rectly  measure  the  value  and  character  of  the  information  that  weald  lead 
to  a  choice  and  they  measure  it  in  terms  of  subjective  probability  of  cor- 
rectiness  or,  equivalently,  degree  of  confidence. 

Although  there  have  been  nany  attempts  in  the  past  to  measure  degree  of  con¬ 
fidence,  to  allow  partial  knowledge,  and  to  eliminate  guessing  (all  of  which 
have  met  with  notable  lack  of  success)  there  are  still  reasons  to  believe 
that  it  is  possible  to  improve  upon,  the  choice  method  of  test  administration. 

At  first  these  reasons  were  logical  and  mathematical.  After  the  fact  it  is 
clear  that  the  first  requirement  that  any  method  of  measuring  an  examinee's 
confidence  mist  meet  is  that  it  be  in  his  best  interest  to  honestly  express 
his  degree  of  confidence.  Toda  (1963)  in  Japan,  de  Finetti  (1961)  in  Italy, 
van  Naerssen  (1961)  in  Holland,  and  Roby  (1965)  in  the  United  States  indepen¬ 
dently  discovered  special  ways  of  rewarding  a  subject  according  to  his  as¬ 
signment  of  confidence  to  the  alternatives  in  a  choice  problem.  These  scor¬ 
ing  systems  all  had  the  special  property  that  the  subject  could  maximize  his 
expected  score  if  and  only  if  he  honestly  expressed  his  degree  of  confidence 
in  the  alternatives.  Shuford,  Albert  and  Massengill  (1966)  further  rational¬ 
ized  these  scoring  systems  and  extended  their  use  to  the  fill-in-the-blank  or 
completion-type  item.  It  is  significant  that  when  the  previous  and  unsuccess¬ 
ful  attempts  to  measure  degree  of  confidence  are  examined  as  to  the  mathemati¬ 
cal  properties  of  their  scoring  systems,  none  of  them  have  been  found  to  use 
an  admissible  scoring  system.  In  each  case,  the  scoring  system  was  either  so 
ill-defined  that  it  constituted  a  projective  test  to  the  examinee,  or  if  well- 
defined,  the  scoring  system  had  the  property  that  the  examinee  would  make  a 
better  test  score  by  not  responding  with  his  actual  degree  of  confidence.  Al¬ 
though  this  may  not  explain  completely  the  reasons  for  failure  of  the  earlier 
studies,  it  certainly  gives  us  reason  not  to  be  discouraged  by  the  long  run 
of  negative  findings  with  respect  to  the  promise  of  confidence  testing. 

Recognition  of  the  new-found  promise  of  confidence  testing  lead  to  the  occur- 
ance  of  two  significant  events.  The  Shuford-Massengill  Corporation  undertook 
the  development  of  training  and  response  aids  and  other  materials  and  proced¬ 
ures  which  would  make  confidence  testing  feasible  and  suitable  for  wide  spread 

★ 

The  Fourth  Semi-Annual  Technical  Report  (which  covers  the  period 
November  1967  through  April  1968)  of  work  perrormed  under  contract 
number  AF  49(638) -1744,  ARPA  Order  Number  833,  by  The  Shuford-Mass¬ 
engill  Corporation,  One  Wallis  Court,  Lexington,  Massachusetts  02173 


use  in  the  public  school  as  an  lap  roved  way  of  administering  teacher-nade 
classroom  tests  and  quizzes.  The  Advanced  Research  Projects  Agency  of  the 
Department  of  Defense  contracted  with  the  Corporation  to  develop  the  deci¬ 
sion-theoretic  psychometrics  necessary  to  guide  and  support  applications  of 
confidence  testing  to  military'  operations.  This  contract  effort  yielded 
numerous  reports  (Massengill  5  Shuford,  1966;  1967;  196.8;  Shuford  §  Massen- 
gi.ll,  1966a;  1966b;  1967a;  1967b;  1967c)  which  spelled  out  the  conditions 
under  which  confidence  testing  could  yield  large  gains  in  personnel  selec¬ 
tion,  classification,  training,  and  education.  In  response  to  the  promising 
indications  of  this  theoretical  work,  the  Advanced  Research  Projects  Agency 
expanded  the  contract  effort  to  include  collecting  and  analyzing  comparative 
data  on  the  performance  of  choice  and  confidence  testing  with  military  selec¬ 
tion  and  classification  tests.  This  Semi-Annual  Technical  Report  is  devoted 
to  the  analysis  of  the  performance  of  confidence  testing  in  the  administra¬ 
tion  of  the  multiple-choice  items  of  the  Airman  Qualifying  Examination-66 
which  is  currently  used.by  the  Air  Force  Recruiting  Service  for  the  selection 
and  classification  of  non-prior-service  applicants  to  determine  their  enlist¬ 
ment  qualifications  and' aptitudes. 


PROCEDURE 

As  a  result  of  discussions  with  Robert  B.  Stephens  of  Hq.  USAF  and  Bart  M. 
Vitola  of  the  Personnel  Research  Laboratory,  we  had  planned  to  readminister 
the  multiplevchoice  portion  of  AQE-766  to  about  300  basic  airmen  in  training 
at  Lackland  Air  Force  Base.  Each  basic  airman  devotes  half  a  day  to  parti¬ 
cipating  in  the  experimental  testing  program  of  the  Personnel  Research  Lab¬ 
oratory.  The  airman  has  taken  the  AQE-66  at  a  recruiting  station  or  at  a 
high  school  prior  to  entering  service  with  the  Air  Force.  By  retrieving  the 
original  test  answer  sheets  of  the  airmen,  we  could  compare  the.  data  from  the 
original  administration  of  AQE-66  as  a  choice  test  with  the  data  obtained  from 
the  readministration  as  a  confidence  test.  This  comparison  could  apply  only 
to  the  ISO  multiple-choice  items  of  the  AQE-66.  The  other  test  items  are 
speeded  computational  problems  for  which  confidence  testing  would  not  be  ap¬ 
propriate.  All  of  the  150  multiple-choice  items  are  five  alternative  with 
the  exception  of  several  three  and  four  alternative  items. 

Upon  examining  the  materials  for  confidence  testing,  however,  the  staff  of 
the  Personnel  Research  Laboratory  expressed  reservations  as  to  the  feasi¬ 
bility  of  the  method,  since  the  airmen  would  represent  a  broad  cross  section 
of  educational  levels  and  academic  backgrounds.  If  the  airmen  failed  to 
clearly  understand  the  rationale  and  procedures  of  confidence  testing,  then 
the  resulting  confusion  would  get  in  the  way  of  their  responding  in  a  mean¬ 
ingful  fashion  and  disrupt  the  testing  process  to  the  extent  that  the  possi¬ 
ble  gains  from  confidence  testing  would  not  be  realized.  Realizing  that  we 
would  have  only  a  three  to  three  and  half  hour  period  to  train  the  airmen  in 
how  to  take  a  confidence  test  and  to  have  them  complete  150  five  alternative 
items,  many  of  them  of  considerable  complexity  and  difficulty,  we  found  this 
arguement  somewhat  compelling.  We  proposed,  therefore,  that  instead  of  risk¬ 
ing  a  larger  scale  program  using  300  airmen  we  scale  down  the  first  testing 
to  include  two  small  groups,. each  of  which  would  represent  a  cross  section  of 
airmen  currently  in  training  at  Lackland  Air  Force  Base.  This  plan  was  agreed 
to  by  the  Personnel  Research  Laboratory  and  the  experimental  testing  took  place 
January  26,  1968. 


Test  booklets  and  testing  facilities  and  personnel  were  provided  by  the  Per¬ 
sonnel  Research  Laboratory.  Mr.  C.  L.  Cannon  and  Sgt.  I.  T.  Busby,  of  the 
Peisonnel  Research  Laboratory  assisted  in  the  administration  of  the  test. 
Thirty  airmen  were  tested  in  the  morning  and  32  were  tested  in  the  afternoon. 
Edward  Massengill  of  the  Corporation  spent  approximately  one  hour  with  each 
g'roup  instructing  them  how  to  take  a  confidence  test.  The  airmen  were  then 
given  a  fifteen  minute  break  and  returned  to  take  the  150  multipie-choice 
items  of  the  AQE-66. 

The  airmen  were  allowed  one  hour  and  forty-five  minutes  to  take  the  test. 

This  is  the  time  normally  allowed  for  taking  this  portion  of/AQE-^66.  Some 
of  the  airmen  con$>leted  all  150  items  well  ahead  of  the  time  limit,  while 
others  took  the  full  hour  and  forty- five  minutes  and  completed  less  than  the 
-full  set  of  items.  In  the  original  administration  of  this  test,  many  of  the 
airmen  also  failed  to  complete  all  150  items.  The  major  difference  between 
the  two  administrations  is  that  in  the  original  administration,  the  airmen 
who  failed  to  complete  the  test  tended  to  skip  many  items,  whereas,  in  the 
readministration,  airmen  tended  to  answer  all  items  consecutively  up  to  the 
point  where  the  time  limit  was  imposed. 

All  in  all,  this  test  was  speeded  for  many  of  the  airmen  both  in  the  original 
administration  and  even  more  so  when  it  was  administered  as  a  confidence  test. 
The  mechanics  of  taking  a  confidence  test  require  more  manipulation  and  writ¬ 
ing  on  the  part  of  the  examinee  than  in  the  case  of  a  choice  test.  Further, 
experience  indicates  that  confidence  testing  leads  the  examinee  to  think  much 
more  carefully  about  each  test  question  and  its  answers  than  in  the  case  of 
choice  testing.  When  a  confidence  test  is  speeded,  there  is  less  time  to 
consider  carefully  all  the  relevant  information  pertaining  to  a  test  question 
and  to  carefully  evaluate  this  information  in  terms  of  how  much  confidence  is 
justified  in  each  of  the  possible  answers.  In  addition,  the  easiest  response 
pattern  to  develop  in  a  confidence  test  is  that  in  which  all  the  confidence 
is  placed  on  one  of  the  answers  and  no  confidence  is  placed  on  the  remaining 
answers.  These  considerations  imply  that  speeding  a  confidence  test  will 
make  the  data  look  more  like  that  of  a  choice  test  where,  in  effect,  all  con¬ 
fidence  is  placed  on  one  and  only  one  of  the  possible  answers. 

Each  airman  used  the  SCoRule1  response  aid  to  develop  his  response  to  the 
test  items  and  then  copy  the  appropriate  letters  into  the  corresponding  an¬ 
swer  boxes  (one  letter  for  each  possible  answer)  on  a  standard  answer  sheet 
bound  in  a  test  booklet. 

When  an  airman  finished  the  test,  he  noted  the  time  on  the  front  cover  of 
the  test  booklet,  handed  it  in,  and  left  the  room.  Upon  scanning  the  test 
booklets,  it  was  discovered  that  all  except  one  airman  had  cooperated  in  tak¬ 
ing  the  test,  leaving  30  airmen  in  the  morning  group  and  31  airmen  in  the 
afternoon  group.  The  test  was  then  scored  by  Mr.  J.  E.  Wilbourn  and  Mr.  C. 

L.  Cannon  of  the  Personnel  Research  Laboratory  and  the  resulting  data  was 
forwarded  to  The  Shuford-Massengill  Corporation  for  further  analysis. 


FUNDAMENTAL  ANALYSIS  OF  THE  DATA 

The  airmen  went  through  the  motions  of  taking  a  confidence  test.  They  used 
the  SCoPule  to  develop  their  answers  and  then  they  copied  down  the  letters 


3 


I 


from  the  answer  boxes,  thus,  indicating  various  degrees  oir  confidence  is  the 
different  answers.  Is  there  any  meaning  to  the  data  that  was  produced? 

What  does  it  mean  when  an  airman  writes  down  the  letter  A  for  an  answer  im¬ 
plying  that  he  has  zero  confidence  that  the  answer  is,  in  fact,  the  correct 
answer  to  the  test  item?  What  does  it  mean  when  he  writes  down  a  Z  saying 
that  he  has  complete  confidence  that  the  answer  is  the  correct  answer  to  the 
test  item?  Khat  does  it  mean  when  he  writes  down  an  M  implying  that  he  has 
a  confidence,  of  .48,  and  so  on  for  all  the  letters  of  the  alphabet? 


RESPONSE  VALIDITY 

One  way  of  evaluating  the  meaning  of  this  data  is  to  examine  for  each  letter 
of  the  alphabet  the  frequency  with  which  it  was  used  on  a  correct  answer  re¬ 
lative  to  the  total  frequency,  with  which  the  letter  Was  used.  One  would  hope 
that  the  more  confidence  that  an  airman  places  on  an  answer,  the  more  likely 
it  is  to  be  the  correct  answer.  This  does  not  necessarily  have  to  be  the 
case,  however.  An  airman  could  use  a  random- like  process  for  setting  up  the 
SCoRule-  and  still  produce  a  reasonable  answer  sheet  with  all  sorts  of  differ¬ 
ent  degrees,  of  confidence  indicated.  If  this  process  were  totally  at  ran  do*", 
within  the  constraints  of  the  testing  situation,  then  the  expected  relative 
frequency  of  an  answer  being  correct  would  be  about  20%  and  this  would  be  so 
regardless  of  the  particular  confidence  placed  on  the  answer. 

About  20%  of  those  answers  which  have  been  assigned  zero  confidence  (A)  would 
be  correct;  20%  of  those  answers  which  have  been  assigned  complete  confidence 
(Z)  would  be  correct;  and  so  on  for  all  possible  degrees  of  confidence.  In 
such  a  case,  the  data  would  have  no  meaning. 

Tables  la,  b,  c,  d  show  these  relative  frequencies  as  a  function  of  degree 
of  confidence  (indicated  by  the  alphabet  from  A  through:  Z)  for  each  of  the  61 
airmen.  For  example,  look  at  airman  number  1  in  Table  la.  Four  hundred  and 
thirty  times  he  assigned  A  to  an  answer  and  12  of  those  times  he  placed  the 
A  on  a  correct  answer.  The  relative  frequency  of  an  answer  to  which  this 
airman  assigned  zero  confidence  being  correct  is  12  divided  by  430  or  about 
.04.  This  airman  placed  an  M  on  21  of  the  answers  and  9  of  these  were,  in 
fact,  correct  answers  giving  a  relative  frequency  of  about  .43.  And  finally, 
out  of  the  98  times  that  he  placed  Z  on  an  answer,  87  of  these  answers  were, 
in  fact,  correct  yielding  a  relative  frequency  of  about  .89.  Examination  of 
the  data  in  Table  la,  b,  c,  and  d  indicates  that  no  airman  followed  such  a 
random  response  strategy. 

Table  2  shows  "Ms  same  data  summed  over  all  61  airmen.  The  frequency  with 
which  the  diffex-nt  degrees  of  confidence  are  used  is  a  function  of  the  dif¬ 
ficulty  level  of  the  test.  The  relative  frequency  with  which  answers  assign¬ 
ed  given  degrees  of  confidence  are  correct  is  not  a  function  of  the  difficulty 
level  of  the  test.  Notice  that,  in  general,  this  relative  frequency  increases 
for  the  higher  degrees  of  confidence.  This  is  shown  more  clearly  in  Figure  1 
of  the  Appendix.  There  is  clearly  a  functional  relation  between  the  degree  of 
confidence  assigned  to  an  answer  and  the  relative  frequency  with  which  it  is, 
in  fact,  the  correct  answer.  For  the  group  as  a  whole,  the  more  confidence  an 


I 

I 

i 

i 

i 

i 

k 

i 

i 

i 


Rl 


airman  places  on  an  answer,  the  more  likely  it  is  to  be  the  correct  answer. 

In  this  sense',  the  responses  of  the  group  as  a  whole  will  certainly  have 
meaning. 

Does  a  sir'lar  relation  hold  for  the  data  of  each  airman?  The  data  of  Tables 
la,  b,  c,  and  d  suggest  that  maybe  it  does,  but  the  variation  due  to  the  small 
number  of  observations  for  many  of  the  degrees  of  confidence  make  it  hard  to 
see.  A  clearer  picture  may  be  obtained  if  one  just  looks  at  the  relative 
frequency  of  an  answer  being  correct  when  it  was  assigned  an  A  and  when  it 
was  assigned  a  Z  since  these  are  by  far  the  most  frequently  occuring  confid¬ 
ence  assignments.  These  are  shown  for  each  of  the  61  airmen  in  Table  3.  In 
every  instance,  the  Percent  "Z"  Answers  Correct  is  larger  than  the  Percent 
"A"  Answers  Correct.  This  suggests  that  each  airman  understood  and  did  assign 
more  confidence  to  the  answers  that  he  thought  were  correct.  Thus,  the  data 
admits  of  a  simple  interpretation  that  the  more  confidence  an  airman  assigns 
to  an  answer,  the  more  likely  it  is  to  be  correct. 


GAIN  IN  INFORMATION 

This  alone,  however,  does  not  prove  that  one  is  getting  completely  meaning¬ 
ful  data  from  confidence  testing.  To  see  this,  one  can  consider  a  "blind" 
or  "stupid"  process  which  can  yield  a  monotonic  increasing  relation  (in  fact 
it  is  a  linear  relation)  between  relative  frequency  and  degree  of  confidence. 
Suppose  that  you  were  given  the  answer  sheet  of  an  airman  who  had  taken  AQE-66 
as  a  choice  test  anc  all  that  has  been  indicated  on  the  answer  sheet  is  the 
number  of  the  test  item  and  which  of  the  five  answers  the  airman  chose.  You 
take  this  answer  sheet  and  use  it  to  fill  out  another  answer  sheet  of  the  type 
used  for  confidence  testing.  Remember  that  all  the  items  have  five  possible 
answers  and  that  you  must  put  down  a  letter  indicating  a  degree  of  confidence 
for  each  of  these  answers.  Say  that  on  the  first  test  item  you  look  on  the 
choice  answer  sheet  to  see  which  answer  the  airman  has  chosen  and  put  a  Z  in 
this  answer  box  indicating  that  the  airman  had  complete  confidence  in  that 
answer.  Now,  since  all  the  confidence  has  been  placed  on  that  one  answer, 
there  is  no  confidence  left  over  for  the  other  answers,  and  an  A  must  be 
placed  in  each  of  the  remaining  four  answer  boxes.  You  may  continue  doing 
this  for  a  few  more  items  but  then  change  to  a  strategy  of  taking  the  answer 
chosen  by  the  airman  and  giving  it  a  confidence  of  about  1/2  (say  either  an 
M  or  an  N )  and  giving  one  of  the  other  answers  a  confidence  of  about  1/2. 

This  leaves  no  confidence  for  the  remaining  three  answers  so  give  them  all 
A's,  Follow  this  strategy  for  a  while  and  then  change  to  giving  a  confidence 
of  about  1/3  to  the  answer  chosen  by  the  airman  and  a  confidence  of  about  1/3 
to  each  of  two  of  the  other  answers  leaving  no  confidence  left  over  for  the 
ether  two  of  the  five  answers.  Now  after  a  while,  change  to  a  strategy  where 
you  split  the  confidence  four  ways  and  then  finally  change  to  a  strategy  of 
assigning  equal  confidence  to  all  of  the  five  answers.  Now  yoil  can  vary  back 
and  forth  between  these  strategies  and  mix  them  up  any  way  you  wish  to  get  a 
reasonable  appearing  pseudo-confidence  test  answer  sheet.  You  have  not  used 
any  information  that  was  not  contained  on  the  original  choice  answer  sheet. 

You  know  nothing  whatsoever  about  the  test  item,  or  whether  or  not  the  air¬ 
man's  answer  was  the  correct  one. 


5 


Now,  what  would  this  pseudo-confidence  test  data  look  like  if  we  analyzed  it 
as  before.  Suppose  that  the  airman  had  chosen  the  conrect  answer  60%  of  the 
time  in  the  original  choice  test.  Now  the  pseudo-confidence  test  would  show 
that  about  60%  of  the  Z  answers  were  correct.  For  those  items  where  the  con¬ 
fidence  was  split  between  two  answers,  the  percent  of  M  answers  correct  would 
be  1/2  of  60  plus  10  equals  70  or  35%.  For  those  items  where  the  confidence 
was  split  between  three  answers,,  the  percent  of  I  answers  correct  would  be  1/3 
of  60  plus  10  plus  1C  equals  80  of  26-2/3%.  For  those  items  in  which  the  con¬ 
fidence  was  divided  among  four  answers,  the  percent  of  G  answers  correct  would 
;b,e:-60  plus  10  plus  10  plus  10  equals  90  or  22-1/2%,  while  for  those  items  in 
which  the  confidence  was  divided  among  all  five  answers,  the  percent  of  F  an¬ 
swers  correct  .would,  of  course,  be  about  20%.  Then  finally,  the  percent  of  A 
answers  correct  would  be  about  10%.  If  this  relation  were  plotted  on  a  graph, 
the  expected  values  would  fall  on  a  straight  line  and,  of  course,  the  observ¬ 
ed  values:  would, vary  randomly  around  this  straight  line. 

The  main  thing,  however,  is  that  there  Would  be  a  functional  relation  between 
relative  frequency  and  degree  of  confidence,  but  this  relation  is  certainly 
not  one  that  we  would  be  happy  with.  Ih:generating  the  data  that  yielded  this 
relation,,  you  used  no  information  other  than  contained  in  the  choice  responses 
of  the  airman.  Therefore,  the  pseudo-confidence  responses  can  reflect  no 
valid  information  other  than  that  contained  in  the  choice  responses.  In  order 
for  confidence  responses  to  contain  valid  information  over  and  above  that 
yielded  by  choice  tests,  the  relation  between  relative  frequency  and  degree  of 
confidence  must  be  steeper  than  this  baser level  relation  which  depends  upon: 
the-pericehtage  of  correct  answers  in  the  choice  administration  of  the  test. 

Therefore,  in  order  to  determine  whether  or  not  any  additional  information  has 
been  obtained  from  the  confidence  measurement  administration  of  AQE-66,  the 
.percent  of  Z  answer  correct  must  be  compared  with  the  percent  of  correct  an- 
sWers  that  would- have  been  obtained  if  the  test  had  been  administered  as  a 
choice  test.  A  confidence  test  can  always  be  scored  as  a  choice* test  if  you 
are  Willing  to  make  the  assumption  that  the  examinee  would  have  chosen  that 
answer  in  which  he  indicated  the  greatest  degree  of  confidence.  In  the  event 
that  two  or  mere  answers  are  approximately  tied  for  maximum  degree  of  confid¬ 
ence,  then  the  tie  can  be  broken  by  using  a  table  of  random  numbers.  This 
was  exactly  the  procedure  used  to  obtain  the  inferred  percent  of  correct  an¬ 
swers  shown  in  Table  3.  Notice  that  in  every  case  the  percent  of  Z  answers 
correct  exceeds  the  inferred  percent  correct  answers.  These  comparisons  are 
also  shown  graphically  in  Figure  1*  of  the  Appendix.  It  is  evident  that  there 
are  great  individual  differences  in  these  data.  These  differences  in  part  re¬ 
flect  an  airman’s  ability  to  evaluate  information  in  terms  of  how  much  confid¬ 
ence  is  justified  by  the  information  at  hand.  For  example,  airman  number  26 
as  shown  by  the  data  of  Table  lb  and  Table  3  does  not  know  what  he  knows  and 
what  he  doesn't  know.  He  sees  things  in  terms  of  black  and  white  and  his  data 
is  very  much  like  that  yielded  by  a  choice  test.  In  fact,  it  appears  that  the 
worst  that  confidence  measurement  can  do  is  to  yield  data  like  that  from  a 
choice  test.  Airman  25,  on  the  other  hand,  is  exceptionally  good  at  evaluat¬ 
ing  information.  He  evidenced  many  degrees  of  confidence  on  different  test 
items  and  his  percent  Z  answers  correct  is  about  92  and  his  inferred  percent 
correct  answers  is  about  55. 


6 


DISRUPTION  OF  THE  TESTING  PROCESS 

So  far  the  analysis  of  response  validity  has  been  internal  to  the  experimen¬ 
tal  readministration.  of  AQE-66  as  a  confidence  test.  By  taking  account  of 
information  external  to  this  test  administration  we  can  evaluate  another  hy¬ 
pothesis  about  confidence  testing.  As  mentioned  before,  there  has  been  some 
concern  that  the  procedures  of  confidence  testing  are  so  complex  as  to  con¬ 
fuse  some  of  the  airmen  and  thus,  lower  their,  test  performance.  Confidence 
testing  would  so  interfere  with  the  testrtaking  process  that  a  distorted  pic¬ 
ture  would  be  obtained  of  the  airman's  ability  level.  This  is  certainly  a 
conceivable  outcome  and  if  it  happened,  we  should  find  that  the  Percent  "Z" 
Answers  Correct  computed  for  some  airmen  were  actually  lower  than  the  Percent 
Correct  Answers  yielded  by  the  same  airmen  during  an  independent  administra¬ 
tion  of  the  same  test  as  a  choice  test.  We  did  not  find  this  happening  with 
respect  to  the  Inferred  Percent  Correct  Answers  which  is  one  indicant  of  a 
score  which  a  student  would  have  obtained  ff  he  had  simultaneously  had  been 
given  a  choice  test.  This  is  no  real  test  of  the  hypothesis,  however,  since 
the  Inferred  Percent  Correct  Answers  is  derived  from  the  confidence  test  data 
and  if  it  were  distorted  by  lack  of  understanding  of  the  student,  then  the 
distortion  should:  occur  both  in  the  Percent  "Z"  Answers  Correct  and  in  the 
Inferred  Percent  Correct  Answers. 

As  mentioned  above,  these  airmen  had  already  taken  AQE-66  prior  to  entering 
service  with  the  Air  Force  and,  thus,  their  original  test  performance  could 
provide  an  independent  but  somewhat  remote  check  on  extent  to  which  the  air¬ 
men  were  confused  by  confidence  testing.  We  were  able  to  obtain  the  original 
test  data  of  only  40  out  of  the  61  airmen.  The  percent  of  correct  answers 
given  by  these  40  airmen  during  the  original  administration  of  AQE-66  is  shown 
also  in  Table  3.  In  ever)'  case  but  three,  the  airmen's  Percent  "Z"  Answers 
Correct  is  greater  than  the  Original  Percent  Correct  Answers.  The  three  ex¬ 
ceptions  are  airmen  number  7,  10,  and  56  with  the  greatest  deviation  being 
represented  by  airman  number  10.  When  this  airman  took  AQE-66  for  the  first 
time  and  as  a  choice  test  he  got  70%  of  the  answers  correct.  When  he  took  it 
over  again  as  a  confidence  test,  about  60%  of  the  answers  in  which  he  had  com¬ 
plete  confidence  were,  in  fact,  correct  answers.  This  is  a  difference  of 
about  .10.  It  is  possible  that  this  airman  was  confused  by  the  procedures  of 
confidence  testing  but  it  should  be  noted,  however,  that  his  Inferred  Percent 
of  Correct  Answers  was  about  53%  which  is  somewhat  below  the  60%  "Z"  Answers 
Correct  indicating  that  his  test  data  yielded  information  over  and  above  that 
that  would  have  been  obtained  from  a  choice  administration.  On  the  other 
hand,  airman  number  56  got  about  73%  correct  answers  in  the  original  adminis¬ 
tration  of  AQE-66,  but  we  infer  that  he  would  have  gotten  about  45%  correct 
in  the  readministration  of  the  test.  This  is  a  considerable  drop.  It  should 
be  noticed  that  he  did  get  about  70%  correct  when  he  said  he  had  complete  con¬ 
fidence  in  an  answer.  This  indicates  a  considerable  gain  for  airman  number  56 
in  the  information  obtained  from  confidence  testing  over  that  obtained  from 
choice  testing.  All  in  all,  there  is  very  little  evidence  that  the  procedures 
of  confidence  testing  interferred  with  the  test-taking  performance  of  these 
airmen.  There  is  no  doubt  that  it  can  happen,  but  with  careful  instruction 
in  the  procedures  of  confidence  testing  the  relative  frequency  of  occurrence 
of  confusion  in  examinees  should  be  reduced  to  near  zero. 

It  is  unfair  to  an  examinee  if  he  fails  to  clearly  understand  instructions  for 


7 


taking  a  test  and  if  he  fails  to  adopt  a  test-taking  strategy  w&ich  maximizes 
his  expected  test  score  for  the  aaocnt  of  knowledge  that  he  possesses  asd  this 
will  be  the  case  if  an  airnaa  were  confused  by  the  procedures  of  coaHdrnce 
testing.  It  is  also  the  case,  however,  if  an  airman  fails  to  re  spend  to  all 
the  items  in  the  choice  a&daistratiocs  of  AQjE-66  since  his  expected  test  score 
is  maximized  if  and  only  if  he  responds  to  all  Items  even  to  the  extent  of 
guessing  in  those  situations  where  he  does  tut  know  the  answer.  This  happens 
and  has  been  shown  to  have  major  effect  ca  the  fairness  of  a  choice  test 
(Shuford  and  Massengill,  1963). 


TKE  EXISTENCE  CF  GUESSING 

The  possibility  of  guessing  in  multiple- choice  tests  has  been  recognized  as  a 
prcblec.  In  any  highly  developed  and  perfected  test  such  as  AQE-66,  several 
techniques  have  been  used  to  minimize  the  existence  of  guessing.  First,  the 
decision  to  use  five  alternative  multiple-choice  items  is  intended  to  minimize 
the  effect  of  guessing  on  the  test  results.  Second,  a  major  goal  in  the  writ¬ 
ing  of  test  items  and  the  possible  answers  is  to  write  them  in  such  a  way  that 
when  a  person  doesn’t  know  the  correct  answer,  he  will  be  almost  certain  to 
pick  out  one  of  the  misleads.  In  the  language  of  confidence  testing  the  goal  is 
to  write  test  items  so  that  an  examinee  either  has  a  very  high  degree  of  confid¬ 
ence  in  a  correct  answer  or  a  very  high  degree  of  confidence  in  one  of  the  in¬ 
correct  answers.  If  this  goal  were  achieved,  it  would  certainly  minimize  the 
effect  of  guessing,  since  guessing  occurs  only  when  the  examinee  is  uncertain 
between  the  correct  answer  and  one  or  more  of  the  incorrect  answers.  For  ex¬ 
ample,  an  airman  would  be  in  a  guessing  situation  if  he  assigned  a  confidence 
of  about  1/2  to  the  correct  answer  and  a  confidence  of  1/2  to  one  of  the  In¬ 
correct  answers.  He  would  have  enough  partial  information  to  rule  out  three 
of  the  incorrect  answers  but  he  would  still  be  undecided  between  a  correct  an¬ 
swer  and  one  of  the  incorrect  answers.  If  he  flipped  a  "aental  coin"  to  decide 
between  these  two  answers  his  probability  of  getting  the  right  answer  would  be 
1/2.  On  other  items,  the  airman  could  be  undecided  between  three  out  of  five 
having  partial  information  to  rule  cut  two  of  the  incorrect  answers  and  a  pro¬ 
bability  of  chance  success  of  1/3;  undecided  between  four  out  of  the  five  hav¬ 
ing  enough  partial  information  to  rule  out  one  of  the  incorrect  answers  and  a 
prooability  of  chance  success  of  1/4;  and  finally,  he  could  be  totally  unin¬ 
formed  and  have  equal  confidence  on  all  of  the  five  answers  with  a  probability 
of  chance  success  of  1/5. 

Now  the  Airman's  Qualifying  Examination  has  undergone  a  great  deal  of  develop¬ 
ment  ar.d  refinement  over  years.  Empirical  large  sca^e  item  analysis  procedures 
have  been  used  to  select  the  items  used  in  AOc-66  from  large  pools  of  available 
items.  In  this  sense  then,  AQE-66  represents  a  highly  refined  test  where  a 
great  deal  of  effort  has  been  devoted  to  eliminating  the  effect  of  guessing  on 
the  test  results.  In  spite  of  this,  the  confidence  responses  of  the  airmen 
indicated  that  they  were  encountering  guessing  situations  on  about  1/4  of  the 
items.  This  is  a  major  and  highly  promising  finding  because  every  time  an 
examinee  encounters  a  guessing  situation  in  taking  a  test,  his  response  is 
contributing  error  variance  to  the  test  results  and  this  error  variance,  of 
course,  reduces  the  reliability,  validity,  and  efficiency  of  the  test  data 
(Shuford  and  Massengill,  1967b).  In  confidence  testing,  however,  these  guess- 


8 


cm  mm  tarn  go  mz  m  ssa  m  csss 


ifig  situations  are  discriminated  frees  other  states  of  knowledge  and  thus,  do 
not  add  to  the  error  variance  of  the  test  results.  Thus,  to  the  extent  that 
guessing  is  3  factor  in  choice  testing  there  is  corresponding  rooa  for  improve- 
£Bent  by  administering  the  test  as  a  confidence  test  rather  than  a  choice  test.. 
Since  guessing  situations  are  encountered  by  the  lirnen  taking  AQE-66,  let's 
see  what  effect  this  has  on  derived  neasures  such  as  total  scores  and  aptitude 
scores  for  AQE-66. 


PSYG'ae-ETXIC  analysis  of  choice  and  confidence  data 


AQE-66  is  oade  up  of  ten  subtests  (Yitola  and  Madden,  1967) .  Nine  of  the  sub¬ 
tests  are  composed  of  jeultiple-choice  itess  and  together  make  up  the  150  mul¬ 
tiple-choice  itess  analyzed  ih  the  previous  section.  The  tenth  subtest  is 
cc^iosed  of  60  cossutational  items  and  is  not  included  in  this  analysis.  The 
scores  fro a  the  sub tests  are  added  together  in  various  ways  to  derive  four 
different  aptitude  scores  called  General ,  Administrative,  Itechanical,  ard 
Electronics .  TabJ.e  4  shows  the  number  of  items  that  make  up  each  one  of  these 
aptitude  scores.  There  is,  of  course,  some  overlap  between  the  aptitude  scores 
in  the  sense  that  item  scores  enter  into  both  aptitude  scores. 

Table  4  also  shows  the  corresponding  means  and  standard  deviations  of  the  ap¬ 
titude  and  total  test  scores  for  the  original  administration  of  AQE-66  and 
for  the  experimental  readainistration  both  for  the.  Valid  Confidence  score  and 
for  th?  total  amount  of  confidence  assigned  to  the  correct  answers.  Notice 
that  the  confidence  scores  are  higher  than  the  other  mean  scores.  This  re¬ 
flects  the  nonlinear  characteristic  of  the  admissible  scoring  system  used  in 
Valid  Confidence  Testing.  Notice  also  that  the  standard  deviations  for  the 
Valid  Confidence  scores  are  smaller  than  in  the  other  cases.  This  again  is  a 
reflection  of  the  logarithmic  admissible  scoring  system.  The  scores  of  the 
poorest  performing  airmen  are  being  raised  considerably  and  the  whole  distri¬ 
bution  of  scores  is  being  compressed  toward  the  upper  end  of  the  scale.  In 
thi'  sense  then,  AQE-66  is  probably  too  "easy”  for  optimal  performance  as  a 
confidence  test. 


W  • 


RELIABILITY 

Both  theory  (Shuford  and  Massengill,  1966b;  1967b)  and  intuition  suggests  that 
eliminating  error  variance  due  to  guessing  will  increase  the  reliability  of  a 
test.  Table  5  shows  the  published  reliabilities  (Vitola  and  Madden,  1967)  for 
the  four  aptitude  indexes  of  AQE-66.  These  published  reliabilities  include 
the  score  from  Arithmetic  Computation  Subtest  which  enters  only  into  the  Admin¬ 
istrative  Aptitude  score.  Thus,  this  reliability  is  not  exactly  comparable  to 
the  reliabilities  we  compute.  Table  5  also  shows  the  reliability  indexes  com¬ 
puted  on  the  choice  data  of  the  40  airmen  for  whom  we  could  obtain  the  original 
answer  sheets.  Allowing  for  variation  due  to  sampling,  these  reliabilities 
seem  to  be  in  line  with  the  published  figures.  Notice  that  the  reliability 
index  for  all  150  items  is  .954.  AQE-66  is,  quite  obviously,  an  exceptionally 
reliable  test. 

Table  5  shows  also  the  reliability  indexes  computed  for  the  Valid  Confidence 


scores  of  the  40  airmen.  Reliabilities  of  the  Valid  Confidence  scores  are 
higher  than  the  reliabilities  of  the  original  choice  scores,  but  not  much 
higher.  As  mentioned  above,  the  variances  of  the  Valid  Confidence  scores 
are  considerably  smaller  than  those  of  the  choice  scores.  Ordinarily  such 
a  reduction  in  variance  results  in  a  reduction  in  the  size  of  a  correlation 
coefficient  such  as  the  reliability  index.  This  did  not  happen  in  this  case. 

The  variances  got  smaller  but  the  correlation  coefficient  increased  in  size. 

We  can  get  a  better  picture  of  the  gain  in  reliability  obtained  from  confid¬ 
ence  testing  if  we  look  at  the  total  amount  of  confidence  assigned  to  correct 
answers.  This  measure  yields  score  distributions  much  more  comparable  both  in 
terms  of  means  and  variances  to  the  distributions  of  the  choice  scores  from 
the  original  administration.  The  reliability  indexes  for  this  measure  are 
also  shown  in  Table  5  and  in  every  case  exceed  all  other  reliability  indexes 
in  size. 

Although;  these  differences  in  reliability  may  appear  to  be  trivial,  they  are 
not.  The  correlation  coefficient  is  not  a  linear  measure  of  testing  efficiency. 

As  the  correlation  coefficient  approaches  1,  smaller  and  smaller  differences 
become  more  important. 

One  way  to  evaluate  the  importance  of  the  gain  in  reliability  resulting  from 
changing  from  choice  administration  to  confidence  administration  of  AQE-66  is 
in  terms  of  test  length.  Consider  for  example  how  many  additional  items  would 
be  required  to  make  the  choice  test  as  reliable  as  the  confidence  test  accord¬ 
ing  to  the  total  amount  of  confidence  assigned  to  the  correct  answers.  Table 
6  gives  the  answer  as  derived  from  the  Spearman-Brown  Prophecy  Formula.  From 
37  up  to  56  additional  items  will  have  to  be  added  to  parts  of  AQE-66  to  make 
each  aptitude  score  as  reliable  as  that  obtained  from  giving  the  current  AQE-66 
as  a  confidence  test.  For  the  test  as  a  whole,  121  items  will  need  to  be  added 
to  AQE-66.  In  a  sense  then,  administering  AQE-66  as  a  confidence  test  has  the 
effect,  in  terms  of  reliability,  equivalent  to  considerably  increasing  the  num¬ 
ber  of  items  in  the  test. 

Another  way  of  looking  at  the  relative  efficiency  of  choice  and  confidence  test¬ 
ing  is  to  consider  how  many'  items  can  be  eliminated  from  the  confidence  version 
of  AQE-66  to  reduce  its  reliability  down  to  that  of  the  choice  version  of  AQE-66. 
The  answer,  also  derived  from  the  Spearman- Brown  Prophecy  Formula  is  shown  in 
Table  6.  From  20  to  29  items  can  be  eliminated  from  those  parts  making  up  each 
aptitude  score  and  for  the  test  as  a  whole,  66  items  could  be  eliminated.  In 
other  words,  an  AQE-66  consisting  of  only  84  items  administered  as  a  confidence 
test  would  be  just  as  reliable  as  the  present  AQE-66  of  150  items  administered 
as  a  choice  test.  Savings  indicated  in  Table  6  are  probably  underestimates  of 
what  actually  can  be  achieved  in  practice  since  the  projections  are  based  upon 
the  assumption  of  blind  random  choice  of  items  to  be  omitted.  If  item  analysis 
information  were  used  to  make  an  optimal  selection  of  items  for  inclusion  in 
the  test,  then  it  should  be  possible  to  make  the  test  even  shorter  than  indicated. 


INTERCORRELATION  BETWEEN  APTITUDE  SCORES 

Eliminating  error  variance  due  to  guessing  from  AQE-66  should  not  only  improve 
the  reliability  of  the  test,  but  should,  at  least  under  some  conditions,  in- 


10 


crease  the  correlation  between  the  aptitude  scores.  While  increasing  reliabi¬ 
lity  is  a  desirable  effect,  increasing  intrabattery  correlations  reduces  the 
ability  of  the  test  to  make  differential  predictions.  Given  whatever  "true" 
correlation  there  might  be  between  the  aptitude  scores,  the  introduction  of 
random  error  due  to  guessing  would  serve  to  lower  the  computed  correlations 
but  could  not  improve  differential  prediction.  If  this  were  happening,  then 
eliminating  the  random  error  due  to  guessing  would  allow  the  higher  "true" 
correlations  to  manifest  themselves. 

Table  7  shows  the  correlation  between  pairs  of  aptitude  scores  for  AQE-66. 

The  first  column  shows  the  published  figures  while  the  second  column  shows 
the  correlations  computed  from  the  original  choice  test  data  of  our  40  air¬ 
men.  The  greatest  deviation  between  the  published  figures  and  those  obtained 
from  the  small  sample  of  40  is  for  the  correlations  between  Mechanical  and 
Electronic  Aptitude  scores  where  the  correlation  based  upon  the  small  sample 
is  quite  a  bit  less  than  that  in  the  large-sample  published  data.  It  should 
be  noted  that  three  of  the  six  correlations  are  not  exactly  comparable  to  the 
published  data  because  the  published  data  includes  the  arithmetic  computation 
subtest  as  &  component  of  its  Administrative  Aptitude  score. 

Table  7  also  shows  the  correlations  based  on  Valid  Confidence  scores  and  the 
total  amount  of  confidence  assigned  to  correct  answers  computed  from  the  data 
of  the  40  airmen.  In  every  instance  these  correlations  are  at  least  as  large 
as  the  correlations  computed  from  the  original  choice  data  of  the  same  40 
airmen.  Figures  2,  3,  and  4  of  the  Appendix  graphically  display  the  relations 
between  these  pairs  of  aptitude  scores. 


PREDICTIVE  VALIDITY  OF  CHOICE  AND  CONFIDENCE  DATA 

The  increased  reliability  and  intrabattery  correlations  obtained  in  the  con¬ 
fidence  administration  of  AQE-66  could  be  due,  in  whole  or  in  part,  to  the 
reduction  in  error  variance  produced  by  the  guessing  occuring  when  AQE-66  is 
administered  as  a  choice  test.  It  could  also  be  produced,  in  partj  by  confid¬ 
ence  testing  introducing  new  variance  which  is  not  present  when  AQE-66  is  ad¬ 
ministrated  as  a  choice  test.  This  must  happen  to  a  certain  extent.  Remember 
that  there  are  wide  individual  differences  in  the  airman's  ability  to  evaluate 
information.  This  is  one  reason  why  confidence  testing  yields  a  different 
rank  order  of  examinees  and  would  most  certainly  be  operating  pretty  much 
throughout  the-  test  and  effecting  each  aptitude  score.  Although  there  are 
reasons  for  accounting  for  this  as  true  variance  and  believing  that  it  should 
be  measured  and  reflected  in  any  test  score,  an  unequivocal  answer  cannot  be 
obtained  without  further  research  and  validation  studies. 

There  is,  however,  one  bit  of  data  that  can  be  used  to  get  some  hint  at  the 
ability  of  confidence  testing  to  improve  the  validity  of  AQE-66.  The  records 
of  the  airmen  used  in  this  study  contain  their  Air  Force  Qualifying  Test  scores 
(AFQT) .  This  is,  like  AQE-66,  a  multiple-choice  aptitude  test.  If  these  Air 
Force  Qualifying  Test  scores  were  considered  as  a  criterion  variable,  the  cor¬ 
relation  between  AQE-66  and  AFQT  would  indicate  the  validity  of  AQE-66  consid¬ 
ered  as  a  predictor  of  AFQT.  Figure  5  in  the  Appendix  shows  the  relation  be¬ 
tween  AQE-66  administered  as  a  choice  test  and  AFQT  for  the  40  airmen.  The 
correlation  between  these  two  sets  of  scores  is  .76. 


11 


Figure  6  in  the  Appendix  shews  the  relation  between  the  Valid  C&nfzcsKcs  score 
for  AQE-66  administered  as  a  confidence  test  and  AFQT.  The  reduction  in  vari¬ 
ance  produced  by  the  logarithsic  admissible  scoring  system  is  quite  apparent 
here.  Correlation  for  these  two  sets  of  scores  is  .70. 

When  we  attempt  to  equalize  variances,  however,  by  looking  at  the  total  amount 
of  confidence  on  correct  answers  obtained  from  AQE-66  administered  as  a  confid¬ 
ence  test,  we  find  the  relation  with  AFQT  shewn  in  Figure  7  of  the  Appendix. 

The  AQE-66  data  is  now  such  core  spread  out  with  a  variance  comparable  to  that 
obtained  from  the  choice  administration  of  AQE-66.  The  correlation  in  this 
case  is  .81  which  is  soeewhat  above  that  of  .76  found  for  the  relation  between 
AQE-66  administered  as  a  choice  test  and  AFQT-  Thus,  the  administration  of 
AQE-66  as  a  confidence  test  can  increase  its  predictive  validity. 


DISCUSSION'  AXD  CONCLUSIONS 

The  results  of  this  study  do  not  prove  that  confidence  testing  yielded  improve¬ 
ments  in  personnel  selection,  classification  and  placement.  It  would  be  unrea¬ 
sonable  to  expect  this  free  such  an  experiment.  There  is,  on  the  other  hand, 
nothing  in  these  data  to  deny  the  possibility  that  confidence  testing  can  radi¬ 
cally  improve  the  testing  process. 

This  small  scale  study  may  be  viewed  as  setting  up  a  series  of  hurdles  for  con¬ 
fidence  testing  to  pass,  and  pass  then  it  did.  First,  the  airmen  did  not  get 
marred  down  in  a  sea  of  confusion  about  the  procedures  of  confidence  testing. 

In  general,  they  responded  with  remarkable  intelligence  and  yielded  test  data 
which  contains  information  over  and  above  that  which  is  possible  to  get  when 
AQE-66  is  administered  as  a  choice  test. 

These  data  suggest  that  there  are  wide  individual  differences  in  the  wav  in  which 
airmen  process  and  evaluate  information.  These  differences  are  reflected  in 
their  test  scores,  but  until  there  is  reason  to  believe  that  this  is  not  charac¬ 
teristic  of  their  behavior  outside  of  this  particular  test  situation  there 
should  be  no  cause  for  concern.  In  fact,  this  s»ay  prove  to  be  a  new  source  of 
true  variance  which  would  serve  to  improve  the  validity  of  test  data. 

There  was  no  dearth  of  guessing  situations  encountered  t>/  these  airmen  in  re¬ 
sponding  to  AQE-66  even  though  it  is  a  highly  refined  test  designed  to  mini¬ 
mize  guessing.  This  gave  confidence  testing  an  opportunity  to  reduce  one 
source  of  error  variance  in  AQE-66  and  this  was  reflected  in  the  higher  re¬ 
liability  and  intrabattery  correlations  and  in  the  improved  correlation  be¬ 
tween  AQE-66  and  AFQT. 


REFERENCES 


de  Finetti,  B.  Does  it  male  sense  to  speak  of  good  probability  appraisers? 

In  I.  J.  Good  (Gen.  Ed.)  She  scientist  speculates.  Xev  York :  Basic  Books, 
1962.  Pp.  557-364. 

Masse ngill,  II.  E-  S  Shuford,  E.  II.  (1966)  Decision-theoretic  psychometrics: 
a  logical  analysis  of  guessing.  Lexington,  Massachusetts:  The  Shuford- 
Hassengill  Corporation. 

Massengili,  H.  E.  6  Shuford,  E.  H.  (1967)  vigils  and  teachers  should 

Vnca  about  guessing.  Lexington,  Massachusetts:  The  Shuford-Massengi  1 1 
Corporation. 

Massengili,  H.  E.  5  Shuford,  E.  II.  (1968)  /.  report  on  the  effect  of  degree 
of  confidence  in  student  testing.  Lexington,  Massachusetts:  The  Shuford- 
Massengi  11  Corporation. 

Roby,  T.  B-  (1965)  Belief  states:  a  preliminary  empirical  study.  ESD-TDR- 
64-233,  Decision  Sciences  Laboratory,  L.  G.  Hanscoa  Field,  Bedford,  Mass¬ 
achusetts. 

Shuford,  E.  H-,  Albert,  A.  §  Massengili,  H«  E.  (1966)  Admissible  probability 
□easureaent  procedures-  Psgchometrika,  51,  125-145. 

Shuford,  E.  K.  6  Massengili,  H.  E.  (1966a)  Decision-theoretic  psychometrics: 
the  effect  of  guessing  on  the  quality  of  personnel  and  counseling  decisions. 
Lexington,  Massachusetts:  The  Shuford-Massengi 11  Corporation. 

Shuford,  E.  H.  6  Massengili,  H.  E.  (1966b)  Decision-theoretic  psychometrics: 
the  worth  of  individualizing  instruction.  Lexington,  Massachusetts :  The 
Shuford-Massengi 11  Corporation. 

Shuford,  E.  H.  §  Massengili,  H.  E.  (1967a)  ’The  relative  effectiveness  of 
five  instructional  strategies.  Lexington,  Massachusetts:  The  Shuford- 
Massengi  11  Corporation. 

Shuford,  E.  H.  §  Massengili,  H.  E.  (1967b)  How  to  shorten  a  test  and  in¬ 
crease  its  reliability  and  validity.  Lexington,  Massachusetts:  The 
Shuford-Massengi 11  Corporation. 

Shuford,  E.  H.  §  Massengili,  H.  E.  (1967c)  Individual  and  social  justice 
in  objective  testing.  Lexington,  Massachusetts:  The  Shuford-Massengi 11 
Corporation. 

Toda,  M.  (1963)  l-feasuremant  of  subjective  probability  distributions.  ESD- 
TDR-63-407,  Decision  Sciences  Laboratory,  L.  (1.  Hanscom  Field,  Bedford, 
Massachusetts. 

van  Naerssen,  R.  F.  (1961)  A  scale  for  the  measurement  of  subjective  pro¬ 
bability.  Acta  Psychologica ,  159-166. 

Vitola,  B.  M.  5  Madden,  II.  L.  (1967)  Development  and  standardization  of 
airman  qualifying  examination-66.  Personnel  Research  Laboratory,  Lack- 
land  Air  Force  Base,  Texas 


V 


o  1 


fv  * 


*<! 


c 

1  } 
A 


?lsjLL 


.  s 

'€  5  o  w 

x  3  x  p 
5  cr»  ^ 
u"  ®  >  XI 

^  g 
8  S  S  .§ 
Jot,® 

JT-C  c  10 

z  «  a  u 

I  * 

JX  c  ■«  « 

£  a>  -  « 

“•  *S  j  | 

f-g  8 

#  i  §•$ 

«-  -C  0)  §> 

«  ~  «  -a 

O  £  4-» 


8 

8 


CD 

CM 


IV 

CM 


CO 

CM 


in 

CM 


5SS«H'8?s-Sb 

*“  CM  *- 


cm  co 


co 

^  CM  ^  (M 

W  v-  r  ^ 


00  CM 
CM  CM 


^  r  !0  r  »  n 


co 

i^OCToiwmioooopIocow^S 


CM 

CM 


3 

CM 


CM 

8 


CO 

to 

Lf> 

to 

o 


--  CM  CM 

O  O  r-  » 


OMPJCM^^piSg 


oov-oj*;:** 


?  C  S  P  r  n  n 

o  vr  n  os  ~  ?5 


CD  CM 
O  CM 


'  '  n  n  n 

O  O  CM  CM  CM 


f  Q  <1  (¥j 

C  S  ?  ®  r  r 


01 


g  o  o  ^  ^  ^  SSSSSSSSSS 


CM 

8 

rf 

co 

00 

*  CM 

(0 

CM 

CM 

o 

O 

** 

o 

o 

« 

CO 

CO 

8 

to 

to 

3 

CO 

CM 

CM 

<0 

r- 

CM 

to 

*- 

to 

S 

T» 

CM 

to  V 

W— 

CO 

IO 

CM 

co 

** 

CO  o 

CM 

CM 

r** 

V— 

CO 

to 

CM  CO 

to 

CO 

3 

CM 

*» 

O  ^ 

to 

CO 

8 

CM 

O 

to 

CO 

* 

(0 

CM 

5 

c* 

-V. 

V» 

CM 

CM 

O) 

8 

*• 

o 

<0  52 

T» 

fV 

a> 

CM 

O 

CM 

O 

*v. 

CM 

o  ^ 

r» 

-V. 

* 

-V. 

to 

co 


O 

00 


3 

S 

8 

cc 

T— 

* 

I 

r*. 

r— 

CO 


CM 

(O 

-v 

Cv 

to 


o 

r* 


S3 


00 

*-  rs. 

to  CO 


CM  CO 

^  CM 


CM 


*•  r*  CM 

^  *s 

O  *-  r- 


r> 

o> 


to 


o 

n 


v  co 

CM  O 


00 


n  rj  to  cm  to  to  ^ 

*-  5r  ^  ^  ^  o  ^ 


C  C  C  C  £! 

CM  T-  m-  o  CO 


8 


’■nctr.^n^sjgsjs 


5  C  g 

co  «-  o 


§ 

CM 


*-  o  o 


CO 


CO 

r- 

to 
— * 
in 
CM 


^  v.  *v.  ^-£?'JwrN*»^yCMT*COf\lf\l 

O'CuincNOoOCMLf,  no^nooo^cM 


to 

CO 


r- 

*• 

8 


* 

O) 

CO 


8 

r» 

00 


<9.0  0*0  O  M-  Cl  £  .y  _  £  c 


oao-«.»-3>jx>-N 


1 


I 

I 

I 

I 

I 

I 

I 


n 


i 


*m*i m 


Table  2.  Response  Frequency,  and  Relative  Frequency  of  Answer  being  correct 
asa  Function  of  Degree  of  Confidence. 

Entries  based  on  all  61  airmen.  Right  hand  number  of  entry  shows  frequency 
with  which  degree  of  confidence  was  used  while  left  hand  entry  shows  frequency 
with  which  that  degree  of  confidence  was  assigned  to  a  correct  answer. 


Degree  of  Confidence 

Response  Frequency 

Relative  Frequency 

A 

1,519/26,243 

.058 

B 

16/130 

.123 

C 

28/159 

.176 

d 

77/431 

.179  ■ 

'  E 

312/1,503 

.208 

F 

845/4,240 

.199 

G 

355/1,664 

.213 

H 

129/542 

.238 

1 

142/563 

.252 

J 

64/261 

.245 

K 

48/183 

.268 

L 

52/172 

.302 

M 

357/1,056 

.338 

N 

100/315 

.317 

0 

36/133 

.27*1 

P 

33/69 

.478 

Q 

33/50 

.660 

R 

28/56 

.580 

S 

23/49 

.469 

T 

27/47 

.574 

U 

28/43 

.651 

V 

24/32 

.750 

w 

15/26 

.577 

X 

8/13 

.615 

Y 

16/18 

.889 

Z 

4,481/5,584 

.802 

I 


J 

•  it 

m  - 

l  * 

~Z  TJ 

O  3 

*  ° 

5  « 

•"  as 

•  *-» 

C  • 

O  TO 

a5 

*  s 

I  i 


«?! 
t  i  t 

^  til 


i  5? 


r 

Is 

C  Ul 

•a 

L 

S< 

12 

f  ” 

2  g 

n 

1 

f 

v  w 

E  2 

C  2 

*  5 

!lf 

155 

*l|* 

2  Hi 

•  X  V  T 
H  ui  in  t 


< 

c 

L.  *-» 

4-»  U 

«n  o 

*.  U 

c  *- 

—  o 

e  o 

-o 

<  **- 

o 

|Q  V. 

C  t) 

—  X) 

CD  E 

—  D 

U  2 

O 


c 

c 

4> 

c 

c 

V 

u 

c 

c 

o 

o 

&_ 

a 

o 

o 

if 

IQ 

o 

o 

t> 

c 

J 

tl 

•*— 

4J 

U  X 

•*-» 

VI 

X 

u 

IV 

10 

to 

IV 

“U 

c 

10 

i. 

♦— 

•— 

< 

•— 

> 

4> 

> 

u- 

> 

4) 

VI 

U 

o 

c 

4-» 

4> 

o 

*— 

c 

o 

o 

o 

O 

c 

c? 

o 

1) 

-o 

“O 

T> 

t- 

“O 

u 

E 

l_ 

u- 

L. 

u 

IQ 

“O 

u- 

IQ 

o 

o 

IQ 

“O 

< 

c 

*TO 

O 

"O 

c 

o 

C 

c 

IQ 

*— 

o 

IQ 

c 

o 

to 

4-» 

IQ 

4-J 

3 

4-J 

4~> 

CO 

4J 

~o 

CO 

Q 

CO 

c 

•— 

E 

TJ 

I 

<C 

V 

IQ 

C 

> 

CD 

u 

IQ 

»— 

V 

4-> 

VI 

a 

O 

in 

X 

f-  < 

PERCENT  "A" 

PERCENT  "Z" 

INFERRED  PERCENT 

ORIGINAL  PERcIr 

AIRMAN 

ANSWERS  CORRECT* 

ANSWERS  CORRECT* 

CORRECT  ANSWERS* 

CORRECT  ANSWER-- 

o  c a 

u  o  c- 
C  O 
o  ut 


*  -* 
«  -h  pi  « 

OC  Ci  N  Ci 


c  o 

-HI  O 

* 

■* 

* 

_g  w 

K5 

r^ 

Ci 

LO 

o 

-*y 

Ci 

o 

oc 

<  o 
o 

• — *  H 

e  o 

HO 

*->  o 

d 

^  -O 
o  o 
f-«  w 

r*  d  . 

0.0-0 

CJ  O 
s/)  > 
o  o 

•  *H  *H 
t'  P  h 
to  *J 
O  -H  O 
—<+->£-< 
X)  rt 
QUO 
H  WXI 


O  to  cr. 
t"  ao  c; 


Cl  (M 

r-  co 


o 

> 

■H 

•M 

«/> 

Cl 

•H 

c 

d 

C 

o 

M 

O 

t/> 

o 

■H 

d 

•-H 

Cl 

*5 

Cl 

o 

H 

d 

•H 

o 

—i 

4-> 

ci 

c 

Hi 

4-> 

t/> 

•H 

o 

1 

1 

u 

•H 

c 

Jh 

o 

C 

o 

d 

> 

> 

•"H 

*H 

ci 

•H 

u 

1 

Cl 

o 

*-> 

*-> 

-5 

C) 

d 

d 

< 

3 

Hi 

u 

*•* 

i 

1 

1 

4-> 

o 

»— < 

1^ 

*— * 

</> 

W 

■H 

C3 

d 

d 

•H 

•H 

c 

ri 

c 

c 

d 

O 

o 

o 

•H 

•H 

c 

r— 

c 

£ 

£ 

u 

ci 

5 

o 

-o 

*U 

o 

o 

o 

o 

< 

< 

H  9 

H  ° 
£  X 

I?  o 


a  o  c 

-Q  O  <D 

c  E 
o  o  ^ 

*■£  9  "® 

■i  •“  ° 

c  I  o 
.  c  ^ 
43  _  (D 
«§ffl 
0)  ”  | 
3  m  UJ 

oi»0 

LL  X)  < 


aaoos  aannidv  3Ai±vaisiNii/\iav 


MECHANICAL  APTITUDE  S( 
•.(ORIGINAL  CHOICE  SCORI 

Figure  2b.  Relation  between  General  and  Meehan 
on  number  of  correct  answers  during  original  admi 
40  airmen. 


MECHANICAL  AP' 
(ORIGINAL  CHi 


3H0DS  dOnilldV  1VOINVHD3W 


) 


kdH  t'lWjWwtifctffal  W  1 '  o 


I 


10  20  30  40  50  60  70  80  90  100  110  120 


I 

! 

I 

I 

I 


AIRMAN  QUALIFYING  EXAMINATION-66 
(VALID  CONFIDENCE  SCORE) 

Figure  6.  Relation  between  Air  Force  Qualifying  Test  Score  and  Valid 
Confidence  Score  for  150  multiple-choice  items  from  experimental  ad¬ 
ministration  of  AQE— 66  to  40  airmen. 


O  o  o  O  o  o 

O  00  N  <0  «  ^ 

1S31  ONIAdHVnO  53UOJ 


O  -5 

(O  ®  . 

w  Q.  C 

S5  * 

?! - 
|S° 

o  o2 

3  V 

Ocg 

e  •  | 

2  |ii 
o  So 
u-  »  ^ 

*-  <M  - 

<gO 
|  3  0 

1  o  £ 

|?| 

£  o  -p 

One 
**  Q.  T3 
<010 

•  5  15 

C  c  v 
<D  C 

i 

1  8 1 

§>  °  9. 

—  *4-  X 

U-  o  © 


IKCUSSIFIED 

S-.-Mis-  Qjwtif  n r-x 


DOCUMENT  CONTROL  DATA  -RAD 


if  C«*  C  VI 


-fm  **  •  _  —  S  ■“  Jh.tfwA***.' 

r  *  *  *  *  .«  *nCZtg* 


-  -V'-  *W  "J*1*  ' 


?  r+-p*mrf  m*  €t*%**2»*\£0 


The  Shuford-Siassengill  Corporation 
One  tfallis  Court 

Lexington,  Massachusetts  02175 


j  UNCLASSIFIED _ 

EsI  e»sv»> 


AIRMAN  QUALIFYING  EXAMINATION- 66  ADMINISTERED  AS  A  CONFIDENCE  TEST 


i  tSC***«  "  .*  %  Z  mf%  -Tiy«r  —5  ^arswtf  aaurf*»tt\r  - 

ScitjiZ'&C'  -CjiteeA*^ 

T  »HC **  •  -  ’#  *  fMtJrw,,  iwfl5r  l+%t  G~*Zi« 


Erair  H.  Shuforc,  Jr.  §  H.  Edvard  Massengill,  Jr. 


*  *•*  tCIlt  S*T 

May  1968 

f c*ro*:-  c*-  w 

AF  49(638) -1744 

a.  -’65-*:- 

920F-9719 
L,!54SOJ  £ 

-  (r?S3l3 

I*  *"'*•  S’A'tvev- 


74.  rc*4t  or  paces  rfc.  vo.  or  «Cr5 

13  14 

X*.  C»«5»«rfATCO-$  BCPOPT  Nt^eCSlSl 

SHC  R-12 


c:-E6  *»£6>0«T  wcisi  (Ary  ctScr  ccsricrj  t£i*t  cr*r  £•*■  m+%tgFzrd 

SfOSR  68-2162 


1-  Shis  document  has  beam  approved  fer-sablic 
reTeate  arc  sale ;  its  distribution  is  unlimited. 


| ir  5*>tsiSC«wSt*::*TARy  activity 


Air  Force  Office  of  Scientific  Research 
1400  Wilson  Boulevard  (SRLB) 
Arlington.  Virginia  22209 


Airman  Qualifying  Exaaination-66  was  readninistered  as  a  Valid  Confidence  test 
to  61  basic  ai risen. 

Airmen  understood  the  method  of  confidence  testing  and  yielded  data  containing 
information  over  and  above  that  available  from  choice  testing.  There  is  no  evidence 
that  confidence  testing  disrupted  the  test-taking  process. 

Wide  individual  differences  were  observed  in  airmen's  ability  to  evaluate  in¬ 
formation.  Observed  patterns  of  confidence  indicate  that  airmen  would  be  guessing 
on  about  one- fourth  of  the  items  if  AQE-66  had  been  administered  as  a  choice  test. 

Confidence  test  administration  served  to  increase  reliabilitj  of  AQE-66  to  the 
extent  that  it  was  equivalent  to  a  choice  test  about  twice  as  long  as  the  current  test. 

Confidence  test  administration  served  to  increase  predictive  validity  of  AQE-66 
as  measured  by  the  correlation  between  AQE-66  3r,d  AFQT.  . 


1473 


UNCLASSIFIED 


^CLASSIFIED 


